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Abstract 

When  a  malleable  job  is  submitted  to  a  space-sharing  parallel  computer,  it  must  choose 
often  whether  to  begin  execution  on  a  small,  available  cluster,  or  wait  in  queue  for  more 
processors  to  become  available.  To  make  this  decision,  it  must  predict  how  long  it  will  have 
to  wait  for  the  larger  cluster.  We  propose  statistical  techniques  for  predicting  these  queue 
times,  and  develop  an  allocation  strategy  that  uses  these  predictions.  We  present  a  workload 
model  based  on  the  environment  we  have  observed  at  the  San  Diego  Supercomputer  Center, 
and  use  this  model  to  drive  simulations  of  various  allocation  strategies.  We  conclude  that 
prediction-based  allocation  not  only  improves  the  average  turnaround  time  for  the  jobs;  it 
also  improves  the  utilization  of  the  system  as  a  whole. 

Keywords:  parallel,  space-sharing,  partitioning,  scheduling,  allocation,  malleable,  multipro¬ 
cessor,  prediction,  queue. 


1  Introduction 

Like  many  shared  resources,  parallel  computers  are  susceptible  to  a  tragedy  of  the  commons  — 
individuals  acting  in  their  own  interests  tend  to  overuse  and  degrade  the  resource.  Specifically, 
users  trying  to  minimize  runtimes  for  their  jobs  might  allocate  more  processors  than  they  can  use 
efficiently.  This  overindulgence  lowers  the  effective  use  of  the  system  and  increases  the  queue  times 
of  all  users’  jobs. 

A  partial  solution  to  this  problem  is  a  programming  model  that  supports  malleable  jobs,  /.  e., 
jobs  that  can  be  configured  to  run  on  clusters  of  various  sizes.  In  existing  systems,  these  jobs 
cannot  change  their  cluster  sizes  dynamically;  that  is,  once  they  begin  execution,  they  cannot  add 
or  yield  processors. 

Malleable  jobs  improve  system  utilization  by  using  fewer  processors  when  system  load  is  high, 
thereby  running  more  efficiently  and  increasing  the  number  of  jobs  in  the  system  simultaneously. 
But  malleable  jobs  are  not  a  sufficient  solution  to  the  tragedy  of  the  commons,  because  users  have 
no  direct  incentive  to  restrict  the  cluster  sizes  of  their  jobs.  Furthermore,  even  altruistic  users 
might  not  have  the  information  they  need  to  make  the  best  decisions. 

One  solution  to  this  problem  is  a  system-centric  scheduler  that  chooses  cluster  sizes  automat¬ 
ically,  trying  to  optimize  (usually  heuristically)  a  system-wide  performance  metric  like  utilization 

*EECS  —  Computer  Science  Division,  University  of  California,  Berkeley,  CA  94720  and  San  Diego  Supercomputer 
Center,  P.O.  Box  85608,  San  Diego,  CA  92186.  Supported  by  NSE  grant  ASC-89-02825  and  by  Advanced  Research 
Projects  Agency/ITO,  Distributed  Object  Computation  Testbed,  ARPA  order  No.  D570,  Issued  by  ESC/ENS 
under  contract  :j^El9628-96-C-0020.  email:  downey@sdsc.edu,  http:/ /www.sdsc.edu/^  downey 
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or  average  turnaround  time.  The  problem  is  that  such  systems  often  force  users  to  accept  decisions 
that  are  good  for  the  system  as  a  whole,  but  contrary  to  their  immediate  interests.  For  example, 
if  there  is  a  job  in  queue  and  one  idle  processor,  a  utilization-maximizing  system  might  require  the 
job  to  run,  whereas  the  job  might  obtain  a  shorter  turnaround  time  by  waiting  for  more  processors. 

If  such  strategies  are  implemented,  there  are  two  possible  outcomes:  at  best,  users  will  be 
unsatisfied  with  the  system;  at  worst,  they  will  take  steps  to  subvert  it.  Since  these  systems  often 
rely  on  application  information  provided  by  users,  it  is  not  hard  for  a  disgruntled  user  to  manipulate 
the  system  for  his  own  benefit.  In  anecdotal  reports  from  supercomputer  centers,  this  sort  of  user 
behavior  is  common,  and  not  restricted  to  malevolent  users;  rather,  it  is  an  understanding  in  these 
environments  that  users  will  take  advantage  of  loopholes  in  system  policies. 

Given  that  this  is  true,  an  important  property  for  a  scheduling  strategy  is  robustness  in  the 
presence  of  self-interested  users.  As  we  will  show  in  Section  5,  many  commonly-proposed  allocation 
strategies  do  not  have  this  property;  their  overall  performance  degrades  severely  if  users  try,  naively, 
to  improve  the  performance  of  individual  jobs. 

Thus  our  goal  is  to  find  a  scheduling  strategy  that  does  not  make  decisions  that  are  contrary 
to  the  interests  of  the  users.  We  propose  an  application- centric  scheduler  that  uses  application 
information  (run  times  on  various  cluster  sizes)  and  system  state  (predicted  queue  times)  to  choose 
the  cluster  size  with  the  shortest  expected  turnaround  time  for  each  job. 

This  strategy  optimizes  the  performance  of  individual  jobs,  so  users  have  no  incentive  to  subvert 
its  decisions.  The  question,  though,  is  what  effect  these  local  optimizations  will  have  on  the 
performance  of  the  system  as  a  whole.  Using  simulations  based  on  a  stochastic  workload  model, 
we  show  that  the  performance  of  one  such  strategy  exceeds  that  of  the  best  system-centric  scheduler, 
improving  both  system  utilization  and  average  turnaround  time. 

1.1  Queueing  strategy 

A  scheduling  strategy  consists  of  a  queueing  strategy^  which  chooses  which  job  in  queue  begins 
execution,  and  an  allocation  strategy^  which  chooses  how  many  processors  are  allocated  to  each 
job. 

In  this  paper,  we  will  be  using  only  first-come-first-served  (FCFS)  queueing  strategies.  Thus, 
once  a  job  arrives  at  the  head  of  the  queue,  its  remaining  queue  time  does  not  depend  on  the  other 
jobs  in  queue  or  on  future  arrivals.  Furthermore,  if  the  job  at  the  head  of  the  queue  decides  to 
hold  out  —  to  leave  processors  idle  until  a  larger  cluster  is  available  —  the  other  jobs  in  queue  are 
not  permitted  to  pass  it  by. 

In  real  systems,  there  are  often  several  queues  with  different  priorities.  The  queueing  strategy 
within  each  queue  is  FCFS,  but  jobs  in  different  queues  do  not  necessarily  run  in  the  order  they 
arrive.  The  techniques  we  present  here  can  be  extended  to  model  this  environment,  although  we 
expect  it  to  be  more  difficult  to  predict  the  behavior  of  low-priority  queues,  since  they  are  affected 
by  higher-priority  arrivals. 

Some  prior  studies  have  proposed  more  general  non-FCFS  queueing  strategies  [11]  [4].  These 
strategies  have  the  potential  to  reduce  turnaround  times  by  identifying  short  jobs  (one  way  or 
another)  and  giving  them  priority.  Furthermore,  a  non-FCFS  strategy  would  discourage  jobs  from 
holding  out  for  more  processors,  since  a  stubborn  job  would  risk  starvation.  Thus  the  immediate 
effect  might  be  to  reduce  the  number  of  idle  processors.  Despite  these  advantages,  it  is  not  clear 
that  non-FCFS  strategies  will  improve  system  performance.  If  jobs  in  queue  are  forced  to  compete 
for  idle  processors,  we  expect  the  result  to  be  similar  to  the  ASP  strategy  we  examined  in  [5],  in 
which  idle  processors  are  divided  evenly  among  the  jobs  in  queue.  In  that  study,  we  found  that 
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ASP  was  significantly  worse  than  the  FCFS  strategies.  FCFS  strategies  have  one  other  advantage, 
which  is  that  they  are  more  predictable.  This  property  has  a  direct  impact  on  nser  satisfaction,  and 
also  lends  itself  to  metacompnting  environments  in  which  nsers  (or  system  agents)  select  among 
varions  resonrces  according  to  the  predicted  time  nntil  they  are  available  (among  other  things). 

Thns,  althongh  we  consider  non-FCFS  qneneing  strategies  an  interesting  area  for  farther  ex¬ 
ploration,  we  have  not  inclnded  them  in  this  stndy. 

1.2  Related  work 

In  most  existing  systems,  nsers  choose  clnster  sizes  for  their  jobs  by  hand,  and  the  system  does  not 
have  the  option  of  allocating  more  or  fewer  than  the  reqnested  nnmber  of  processors.  Thns,  there 
have  been  no  reports  yet  on  the  effectiveness  of  adaptive  allocation  strategies  in  real  systems. 

Many  simnlation  and  analytic  stndies  have  examined  the  performance  of  system-centric  allo¬ 
cation  strategies;  /.  e.,  strategies  designed  to  maximize  an  aggregate  performance  metric,  withont 
regard  for  individnal  jobs.  In  most  cases,  this  metric  is  average  tnrnaronnd  time  [12]  [21]  [19]  [20] 
[11]  [13]  [4]  [18],  althongh  some  stndies  also  consider  thronghpnt  [16].  Rosti,  Smirni  ei  aL  [17] 
[22]  nse  power^  which  is  the  ratio  of  thronghpnt  to  mean  response  time.  In  [6]  we  nsed  a  parallel 
extension  of  slowdown,  which  is  the  ratio  of  the  actnal  response  time  of  a  job  to  the  time  it  wonld 
have  taken  on  a  dedicated  machine. 

Recently,  there  has  been  interest  in  application-centric  schednling,  in  which  individnal  jobs 
select  resonrces  in  order  to  minimize  their  own  rnn  times,  withont  regard  for  the  performance  of 
the  system  as  a  whole.  The  AppLeS  project  at  the  University  of  California  at  San  Diego  [2]  [3]  and 
the  MARS  project  at  the  Paderborn  Center  for  Parallel  Compnting  [10]  are  applying  this  approach 
in  a  wide-area,  heterogeneons  environment. 

Atallah  ei  aL  [1]  have  proposed  system  agents  that  choose  clnster  sizes  (and  specific  sets  of 
hosts)  to  minimize  the  tnrnaronnd  time  of  individnal  applications.  Since  their  target  system  is 
a  timesharing  network  of  workstations,  they  consider  the  problem  of  contention  with  other  jobs 
in  the  system,  bnt  they  do  not  have  the  problem  of  predicting  the  time  nntil  a  clnster  becomes 
available. 

1.3  Outline 

Section  2  describes  an  abstract  workload  model  we  have  derived  from  observations  at  the  San  Diego 
Snpercompnter  Center.  Section  3  describes  the  statistical  techniqnes  we  nse  to  predict  qnene  times. 
Section  4  describes  the  simnlator  we  nse  to  evalnate  varions  allocation  strategies.  Section  5  presents 
onr  evalnation  of  these  strategies  from  the  application’s  point  of  view,  and  Section  6  discnsses  the 
effect  these  strategies  have  on  the  system  as  a  whole. 

2  Workload  model 

In  order  to  evalnate  the  proposed  allocation  strategies,  we  will  nse  a  simnlation  based  on  an 
abstract  workload  model  On  existing  systems,  we  often  collect  statistics  abont  actnal  (concrete) 
workloads;  for  example,  we  might  know  the  dnration  and  clnster  size  of  each  job.  These  measnred 
workloads  depend  on  the  characteristics  of  the  job  mix,  the  properties  of  the  system  hardware,  and 
the  behavior  of  the  allocation  strategy.  Thns,  it  may  not  be  correct  to  nse  a  concrete  workload 
from  one  system  to  simnlate,  and  evalnate,  another.  Onr  goal  is  to  create  an  abstract  workload 
that  separates  the  effect  of  the  job  mix  from  the  effect  of  the  system. 
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In  [5]  we  propose  a  model  of  malleable  jobs  that  characterizes  each  job  by  three  parameters: 
L,  the  sequential  lifetime  of  the  job,  A,  the  average  parallelism,  and  cr,  which  measures  the  job’s 
variance  in  parallelism.  Using  this  model  we  can  calculate  the  speedup  and  run  time  of  a  job  on  a 
any  number  of  processors.  Section  2.1  summarizes  this  application  model 

Once  we  have  a  model  of  individual  applications,  we  can  construct  a  workload  model  that 
describes  the  system  load,  the  arrival  process,  and  the  distribution  of  application  parameters.  In 
[6]  we  presented  observations  of  scientific  workloads  on  the  Intel  Paragon  at  SDSC  and  the  IBM 
SP2  at  CTC.  We  summarize  those  observations  in  Section  2.2,  and  use  them  to  derive  our  abstract 
workload  model. 

2.1  A  model  of  malleable  jobs 

Our  model  of  parallel  speedup  is  based  on  a  family  of  curves  that  are  parameterized  by  a  job’s 
average  parallelism.  A,  and  its  variance  in  parallelism,  V .  To  do  this,  we  construct  a  hypothetical 
parallelism  profile^  with  the  desired  values  of  A  and  U,  and  then  use  this  profile  to  derive  speedups. 
We  use  two  families  of  profiles,  one  for  programs  with  low  U,  the  other  for  programs  with  high  V. 
In  [5]  we  show  that  this  family  of  speedup  profiles  captures,  at  least  approximately,  the  behavior 
of  a  variety  of  parallel  scientific  applications  on  a  variety  of  architectures. 

2.1.1  Low  variance  model 

Figure  la  shows  a  hypothetical  parallelism  profile  for  an  application  with  low  variance  in  paral¬ 
lelism.  The  potential  parallelism  is  A  for  all  but  some  fraction  a  of  the  duration  (0  <  cr  <  1).  The 
remaining  time  is  divided  between  a  sequential  component  and  a  high-parallelism  component.  The 
average  parallelism  of  this  profile  is  A;  the  variance  of  parallelism  is  U  =  a(A  —  1)^. 

A  program  with  this  profile  would  have  the  following  speedup  as  a  function  of  the  cluster  size 
n: 


S{n)  =  { 


_ An _ 

A-\-a(n  — 1)/2 

_ An _ 

a(A-l/2)-\-n(l-a/2) 

A 


I  <  n  <  A 
A<n<2A-l 
n  >  2A  —  1 


(1) 


2.1.2  High  variance  model 

Figure  lb  shows  a  hypothetical  parallelism  profile  for  an  application  with  high  variance  in  paral¬ 
lelism.  This  profile  has  a  sequential  component  of  duration  a  and  a  parallel  component  of  duration 
1  and  potential  parallelism  A  A  Aa  —  a.  By  design,  the  average  parallelism  is  A;  by  good  fortune, 
the  variance  of  parallelism  is  a(A  —  1)^,  the  same  as  that  of  the  low- variance  model.  Thus  for  both 
models  a  is  approximately  the  square  of  the  coefficient  of  variation,  CV‘^ .  This  approximation 
follows  from  the  definition  of  coefficient  of  variation,  CV  =  VU /A.  Thus,  CV‘^  is  a(A  —  1)^/A^, 
which  for  large  A  is  approximately  a. 

A  program  with  this  profile  would  have  the  following  speedup  as  a  function  of  cluster  size: 


Sin) 


nn((7  +  l) 
A-\-Aa  —  a-\-na 


1  <  n  <  A  A  Act  —  cr 


A 


n  >  A  A  Act  —  cr 


(2) 
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The  parallelism  profile  is  the  distribution  of  potential  parallelism  during  the  execution  of  a  program[2l]. 
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a)  Low-variance  model 

Hypothetical  parallelism  profile 
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b)  High-variance  model 
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Fignre  1:  The  hypothetical  parallelism  profiles  we  nse  to  derive  onr  speednp  model. 
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Speedup  models 


Figure  2:  Speedup  curves  for  a  range  of  values  of  cr. 


Figure  2  shows  a  set  of  speedup  curves  for  a  range  of  values  of  cr  (and  A  =  64).  When  cr  =  0  the 
curve  matches  the  theoretical  upper  bound  for  speedup — bound  at  first  by  the  “hardware  limit” 
(linear  speedup)  and  then  by  the  “software  limit”  (the  average  parallelism  A).  As  cr  approaches 
infinity,  the  curve  approaches  the  theoretical  lower  bound  on  speedup  derived  by  Eager  ei  aL  [8]: 

Smin  (n)  =  An/ (A  +  n  -  1)  (3) 

When  cr  =  1  the  two  models  (high  and  low  variance)  are  identical. 

2.2  Distribution  of  parameters 

With  the  speedup  model  in  the  previous  section,  we  can  use  F,  A,  and  cr  to  calculate  the  run  time  of 
a  job  on  any  number  of  processors:  The  speedup  on  n  processors  is  S(n,  A,  cr);  the  run  time  is  L/ S. 
Of  course,  for  many  jobs  there  will  be  ranges  of  n  where  this  model  is  inapplicable.  For  example, 
a  job  with  large  memory  requirements  will  run  poorly  (or  not  at  all)  when  n  is  small.  Also,  when 
n  is  large,  speedup  may  decrease  as  communication  overhead  overwhelms  computational  speedup. 
Thus  we  qualify  our  application  model  with  the  understanding  that  for  each  application  there  may 
be  a  limited  range  of  viable  cluster  sizes. 

2.2.1  Lifetimes 

Ideally,  we  would  like  to  know  the  distribution  of  F,  the  sequential  lifetime,  for  a  real  workload. 
But  the  accounting  data  we  have  from  real  systems  does  not  contain  sequential  lifetimes;  since 
most  jobs  run  on  more  than  one  processor,  we  do  not  know  what  their  sequential  lifetimes  would 
be.  Furthermore,  sequential  lifetime  is  not  always  defined,  since  many  jobs  cannot  run  on  a  single 
processor  because  of  memory  requirements.  Thus,  F  is  really  an  abstract  measurement  of  the  total 
work  a  job  does,  rather  than  a  literal  measurement  of  its  run  time  on  one  processor. 

The  accounting  data  we  do  have  is  the  total  allocated  time,  T,  which  is  the  product  of  wall  clock 
lifetime  and  cluster  size.  For  programs  with  linear  speedup,  T  equals  F,  but  for  programs  with 
sublinear  speedups,  T  can  be  much  larger  than  F.  Nevertheless,  we  assume  that  the  distributions 
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Distribution  of  total  allocated  time 


Figure  3:  Distribution  of  total  allocated  time  (wall  clock  time  multiplied  by  number  of  processors) 
for  24907  batch  jobs  from  the  Intel  Paragon  at  the  San  Diego  Supercomputer  Center  (SDSC).  The 
gray  line  shows  the  model  used  to  summarize  the  distribution. 


of  L  and  T  have  the  same  shape,  but  different  parameters.  In  our  simulations,  this  is  true:  the 
distribution  of  L  (which  is  an  input  to  the  simulator)  and  the  distribution  of  T  (which  depends  on 
the  application  characteristics  and  scheduling  policy)  have  the  same  shape,  although  the  values  of 
T  are  higher  due  to  non-linear  speedups. 

This  result  allows  us  to  use  observed  distributions  of  T  to  construct  the  distribution  of  L  for 
our  workload.  We  have  examined  accounting  logs  from  the  Intel  Paragon  at  the  San  Diego  Super¬ 
computer  Center  (SDSC);  Figure  3  shows  the  distribution  of  total  allocated  time  for  24907  jobs 
that  ran  on  this  machine  during  a  nine-month  perion  in  1995.  The  distribution  is  approximately 
linear  in  log  space,  which  implies  that  the  cumulative  distribution  function  (cdf)  has  the  form: 

cdfT(t)  =  Pr{T  <  t}  =  /?o  +  /?i  Int  ^  <  imax  (4) 

where  /?o  and  /?i  are  the  intercept  and  slope  of  the  observed  line.  The  upper  and  lower  bounds  of 
this  distribution  are  imin  —  and  imax  —  _ 

Since  this  distribution  is  uniform  in  log  space,  we  call  it  a  umform-log  distribution.  We  know 
of  no  theoretical  reason  that  the  distribution  should  have  this  shape,  but  we  believe  that  it  is 
pervasive  among  batch  workloads,  since  we  have  observed  similar  distributions  on  the  IBM  SP2  at 
the  Cornell  Theory  Center  and  the  Cray  C90  at  SDSC,  and  other  authors  have  reported  similar 
distributions  on  other  systems  [9]  [23]. 

Since  T  overestimates  the  sequential  lifetimes  of  jobs,  our  workload  model  uses  a  distribution 
of  L  with  a  somewhat  lower  maximum  than  the  distributions  we  observed.  In  our  simulations,  L  is 
chosen  from  a  uniform-log  distribution  with  minimum  and  maximum  seconds.  The  median 
of  this  distribution  is  18  minutes;  the  mean  is  271  minutes. 
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Figure  4:  Distribution  of  cluster  sizes  for  the  workload  from  SDSC.  The  gray  line  is  the  distribution 
we  used  for  our  workload  model. 

2.2.2  Average  parallelism 

For  our  workload  model,  we  would  like  to  know  the  parallelism  profile  of  the  jobs  in  the  workload. 
But  the  parallelism  profile  refiects  potential  parallelism,  as  if  there  were  an  unbounded  number  of 
processors  available,  and  in  general  it  is  not  possible  to  derive  this  information  by  observing  the 
execution  of  the  program. 

In  the  accounting  data  we  have  from  SDSC,  we  do  not  have  information  about  the  average 
parallelism  of  jobs.  On  the  other  hand,  we  do  know  the  cluster  size  the  user  chose  for  each  job,  and 
we  hypothesize  that  these  cluster  sizes,  in  the  aggregate,  refiect  the  parallelism  of  the  workload. 

Figure  4  shows  the  distribution  of  cluster  sizes  for  the  workload  from  SDSC.  Almost  all  jobs 
have  cluster  sizes  that  are  powers  of  two.  The  Paragon  does  not  require  power-of-two  cluster  sizes, 
but  the  interface  to  the  queueing  system  suggests  powers  of  two  and  few  users  have  any  incentive 
to  resist  the  combination  of  suggestion  and  habit.  We  hypothesize  that  the  step-wise  pattern  in 
the  distribution  of  cluster  sizes  refiects  this  habit  and  not  the  true  distribution  of  A. 

Thus  for  our  workload  model,  the  distribution  of  A  is  uniform-log  with  parameters  min  =  1  and 
max  =  128.  The  gray  line  in  Figure  4  shows  this  distribution.  At  present  observed  distributions 
contains  more  sequential  jobs  than  our  model,  but  this  surplus  is  disappearing  as  the  workload 
matures.  In  the  first  two  months  of  1996,  the  fraction  of  sequential  jobs  at  SDSC  was  21%,  down 
from  over  40%  a  year  prior.  Thus  more  recent  workloads  match  our  model  well. 

2.2.3  Variance  in  parallelism 

In  general  there  is  no  way  to  measure  the  variance  in  potential  parallelism  of  existing  codes  ex¬ 
plicitly.  In  [5]  we  propose  a  way  to  infer  this  value  from  observed  speedup  curves.  To  test  this 
technique,  we  collected  speedup  curves  reported  for  a  variety  of  scientific  applications  running  on  a 
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variety  of  parallel  computers.  We  found  that  the  parameter  cr,  which  approximates  the  coefficient 
of  variance  of  parallelism,  was  typically  in  the  range  0-2,  with  occasional  higher  values. 

Although  these  observations  provide  a  range  of  values  for  cr,  they  do  not  tell  us  its  distribution 
in  a  real  workload.  For  this  study,  we  use  a  uniform  distribution  between  0  and  2. 


3  Predicting  queue  times 

In  [7]  we  present  statistical  techniques  for  predicting  the  remaining  queue  time  for  a  job  at  the 
head  of  the  queue.  We  summarize  these  techniques  here. 

We  describe  the  state  of  the  machine  at  the  time  of  an  arrival  as  follows:  there  are  p  jobs 
running,  with  ages  a*  and  cluster  sizes  rii  (in  other  words,  the  ith  job  has  been  running  on  rii 
processors  for  a*  seconds).  We  would  like  to  predict  Q{n'),  the  time  until  additional  processors 
become  available,  where  —  n  —  njree  and  njree  is  the  number  of  processors  already  available. 
In  the  next  two  sections  we  present  ways  to  estimate  the  median  and  mean  of  Q{n'). 


3.1  Predictor  A:  median 

We  can  calculate  the  median  of  Q{n')  exactly  by  enumerating  all  possible  outcomes  (which  jobs 
complete  and  which  are  still  running),  and  calculating  the  probability  that  the  request  will  be 
satisfied  before  a  given  time  t.  Then  we  set  this  probability  equal  to  0.5  and  solve  for  the  median 
queue  time.  This  approach  is  not  feasible  when  there  are  many  jobs  in  the  system,  but  it  leads 
to  an  approximation  that  is  fast  to  compute  and  that  turns  out  to  be  sufficiently  accurate  for  our 
intended  purposes. 

We  represent  each  outcome  by  a  bit  vector,  6,  where  for  each  bit,  hi  —  0  indicates  that  the  ith 
job  is  still  running,  and  hi  —  1  indicates  that  the  ith  job  has  completed  before  time  t.  Since  we 
assume  independence  between  jobs  in  the  system,  the  probability  of  a  given  outcome  is  the  product 
of  the  probabilities  of  each  event  (the  completion  or  non-completion  of  a  job).  The  probability  of 
each  event  comes  from  the  conditional  distribution  of  lifetimes.  For  a  uniform-log  distribution  of 
lifetimes,  the  conditional  distribution  cdf^^a 


1-C(i/L|a  =  Pr{L>t\L>a} 

_  1  -  cdfLjt) 

1  -  cdfL(a) 

1  —  /?o  —  /?i  In  t 

1  —  /?o  ~  /?i  In  a 

and  so  the  probability  of  a  given  event  is 


(5) 


Pr{b}=  P[  cdfL\a,{t)  ■  II  (l  -  crf/L|a.(C)  (6) 

z|&,  =  0  i\b^  =  l 

For  a  given  outcome,  the  number  of  free  processors  is  the  sum  of  the  processors  freed  by  each 
job  that  completes: 


F(h)  =  '^hi  ■  Hi 

i 


(7) 
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Thus  at  time  t,  the  probability  that  the  number  of  free  processors  is  at  least  the  requested 
cluster  size  is  the  sum  of  the  probabilities  of  all  the  outcomes  that  satisfy  the  request: 

Pr{F  >n'}=  Pr{h}  (8) 

h\F{h)>n' 

Finally,  we  find  the  median  value  of  Q{n^)  by  setting  Pr{F  >  =  0.5  and  solving  for  t. 

Of  course,  the  number  of  possible  outcomes  (and  thus  the  time  for  this  calculation)  increases 
exponentially  with  p,  the  number  of  running  jobs.  Thus  this  is  not  a  feasible  approach  when  there 
are  many  running  jobs.  But  when  the  number  of  additional  processors  required  {F)  is  small,  it  is 
often  the  case  that  there  are  several  jobs  running  in  the  system  that  will  single-handedly  satisfy 
the  request  when  they  complete.  In  this  case,  the  probability  that  the  request  will  be  satisfied  by 
time  t  is  dominated  by  the  probability  that  one  of  these  benefactors  will  complete  before  time  t. 

In  other  words,  the  chance  that  the  queue  time  for  processors  will  exceed  time  t  is  approxi¬ 
mately  equal  to  the  probability  that  none  of  the  benefactors  will  complete  before  t: 

Pr{F<n'}^  P[  l-cdfL\aF)  (9) 

z  I  n  j  >  n ' 

The  running  time  of  this  calculation  is  linear  in  p.  Of  course,  it  is  only  approximately  correct, 
since  it  ignores  the  possibility  that  several  small  jobs  might  complete  and  satisfy  the  request.  Thus, 
we  expect  this  predictor  to  be  inaccurate  when  there  are  many  small  jobs  running  in  the  system, 
few  of  which  can  single-handedly  handle  the  request.  The  next  section  presents  an  alternative 
predictor  that  we  expect  to  be  more  accurate  in  this  case. 

3.2  Predictor  B:  mean 

When  a  job  is  running,  we  know  that  at  some  time  in  the  future  it  will  complete  and  free  all  of 
its  processors.  Given  the  age  of  the  job,  we  can  use  the  conditional  distribution  (Equation  5)  to 
calculate  the  probability  that  it  will  have  completed  before  time  t. 

We  will  approximate  this  behavior  by  a  model  in  which  processors  are  a  continuous  (rather  than 
discrete)  resource  that  jobs  release  gradually  as  they  execute.  In  this  case,  we  imagine  that  the 
conditional  cumulative  distribution  indicates  what  fraction  of  a  job’s  processors  will  be  available 
at  time  t. 

For  example,  a  job  that  has  been  running  for  30  minutes  might  have  a  50%  chance  of  completing 
in  the  next  hour,  releasing  all  of  its  processors.  As  an  approximation  of  this  behavior,  we  predict 
that  the  job  will  (deterministically)  release  50%  of  its  processors  within  the  next  hour. 

Thus  we  predict  that  the  number  of  free  processors  at  time  t  will  be  the  sum  of  the  processors 
released  by  each  job: 


-F’  =  •  crf/L|a.C)  (10) 

i 

To  estimate  the  expected  (mean)  queue  time  we  set  F  —  and  solve  for  t. 

3.3  Combining  the  predictors 

Since  we  expect  the  two  predictors  to  do  well  under  different  circumstances,  it  is  natural  to  use  each 
when  we  expect  it  to  be  most  accurate.  In  general,  we  expect  Predictor  A  to  do  well  when  there  are 
many  jobs  in  the  system  that  can  single-handedly  satisfy  the  request  (benefactors).  When  there  are 
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few  benfactors,  we  expect  Predictor  B  to  be  better  (especially  since,  if  there  are  none,  we  cannot 
calcnlate  Predictor  A  at  all).  Thns,  in  onr  simnlations,  we  nse  Predictor  A  when  the  nnmber  of 
benefactors  is  2  or  more,  and  Predictor  B  otherwise.  The  particnlar  valne  of  this  threshold  does 
not  affect  the  accnracy  of  the  combined  predictor  drastically. 

4  Simulations 

To  evalnate  the  benefits  of  nsing  predicted  qnene  times  for  processor  allocation,  we  nse  the  models 
in  the  previons  section  to  generate  workloads,  and  then  nse  a  simnlator  to  constrnct  schednles  for 
each  workload  according  to  the  proposed  allocation  strategies.  We  compare  these  schednles  nsing 
several  snmmary  statistics  as  performance  metrics. 

Onr  simnlations  try  to  captnre  the  daily  work  cycle  that  has  been  observed  in  several  sn- 
percompnting  environments  (the  Intel  iPSC/860  at  NASA  Ames  and  the  Paragon  at  SDSC  [9] 
[23]): 

•  In  the  early  morning  there  are  few  arrivals,  system  ntilization  is  at  its  lowest,  and  qnene 
lengths  are  short. 

•  Dnring  the  day,  the  arrival  rate  exceeds  the  departnre  rate  and  jobs  accnmnlate  in  qnene. 
System  ntilization  is  highest  late  in  the  day. 

•  In  the  evening,  the  arrival  rate  falls  bnt  the  ntilization  stays  high  as  the  jobs  in  qnene  begin 
execntion. 

To  model  these  variations,  we  divide  each  simnlated  day  into  two  12-honr  phases:  dnring  the 
“day-time”  phase,  jobs  arrive  according  to  a  Poisson  process  and  either  begin  execntion  or  join 
the  qnene,  depending  on  the  state  of  the  system  and  the  cnrrent  schednling  strategy.  Dnring  the 
“night-time”  phase,  no  new  jobs  arrive,  bnt  the  existing  jobs  continne  to  rnn  nntil  all  qnened  jobs 
have  been  schednled. 

We  choose  the  day-time  arrival  rate  in  order  to  achieve  a  specified  offered  load,  p.  We  define 
the  offered  load  as  the  total  seqnential  load  divided  by  the  processing  capacity  of  the  system: 
p  —  \  •  E[L'\/N ^  where  A  is  the  arrival  rate  (in  jobs  per  second),  E[L'\  is  the  average  seqnential 
lifetime  (271  minntes  in  onr  simnlations),  and  N  is  the  nnmber  of  processors  in  the  system  (128 
in  onr  simnlations).  We  observe  that  the  nnmber  of  jobs  per  day  is  between  160  (when  p  —  0.5) 
and  320  (when  p  —  1.0). 

5  Results:  application-centric 

In  this  section,  we  simnlate  a  commonly-proposed,  system-centric  schednling  strategy  and  show 
that  this  strategy  often  makes  decisions  that  are  contrary  to  the  interests  of  individnal  jobs.  We 
examine  how  nsers  might  snbvert  snch  a  system,  and  measnre  the  potential  benefit  of  doing  so. 

5.1  AVG:  incentive  for  subversion 

Onr  baseline  strategy  is  AVG,  which  assigns  free  processors  to  qnened  jobs  in  first-come-first-served 
order,  giving  each  job  no  more  than  A  processors,  where  A  is  the  average  parallelism  of  the  job. 
Several  stndies  have  shown  that  this  strategy  performs  well  for  a  range  of  workloads  [21]  [11]  [15] 
[22]  [4]  [6], 


11 


An  important  featnre  of  this  strategy  is  that  it  is  work-conserving:  if  even  one  processor  is  free, 
the  job  at  the  head  of  the  qnene  will  be  forced  to  begin  execntion  immediately.  From  the  system’s 
point  of  view,  work-conservation  is  expected  to  yield  high  ntilization,  bnt  from  the  application’s 
point  of  view,  it  often  makes  decisions  that  are  contrary  to  the  interests  of  the  nsers.  At  the  least, 
nsers  will  object;  in  many  cases,  they  will  snbvert  the  system. 

For  example,  a  nser  might  try  to  avoid  long  rnn  times  by  imposing  a  minimnm  clnster  size  for 
his  jobs.  It  is  probably  necessary  for  a  real  system  to  provide  snch  a  mechanism,  becanse  many  jobs 
cannot  rnn  on  small  clnsters  dne  to  memory  constraints.  The  problem  is  that  inflated  jobs  (ones 
that  wait  in  qnene  for  more  resonrces  than  they  strictly  reqnire)  decrease  ntilization  by  leaving 
processors  idle,  increase  qnene  times  not  only  for  themselves,  bnt  also  for  the  other  jobs  in  qnene, 
and  may  not  even  improve  their  own  performance.  As  a  resnlt,  the  performance  of  the  system 
degrades. 

In  [6]  we  evalnate  a  simple  policy  in  which  nsers  reqnire  that  each  job  allocate  at  least  40%  of  its 
reqnested  clnster  size.  The  overall  resnlt  is  a  24%  increase  in  average  tnrnaronnd  time.  Imposing 
larger  minimnm  clnster  sizes  degrades  the  performance  of  the  system  even  more  dramatically.  This 
resnlt  snggests  that  AVG  will  not  perform  as  well  in  the  presence  of  self-interested  nsers  as  it  does 
in  simnlation. 

5.2  OPT:  scheduling  with  perfect  prediction 

Onr  application-centric  strategy  nses  qnene  time  predictions  to  minimize  the  tnrnaronnd  time  of 
each  job.  By  evalnating  each  possibility,  we  find  the  valne  of  n  that  minimizes  Q{n)  R{n),  where 
Q{n)  is  the  qnene  time  nntil  n  processors  are  available,  and  R(n)  is  the  rnn  time  of  the  job  in  n 
processors.  As  in  AVG,  n  is  restricted  to  be  no  greater  than  A.  In  this  section  we  will  assnme  that 
Q{n)  is  known  deterministically  by  oracnlar  prediction,  and  find  the  maximnm  benefit  individnal 
jobs  might  gain.  In  the  next  section  we  will  see  how  mnch  of  this  benefit  we  can  achieve  with 
realistic  predictors. 

We  ran  AVG  for  120  simnlated  days  with  the  offered  load,  p,  fixed  at  0.75.  For  each  of  the 
30421  jobs,  we  fonnd  the  clnster  size  that  wonld  minimize  its  tnrnaronnd  time.  Many  jobs  (63%) 
were  allocated  the  maximnm  clnster  size,  A  processors.  We  call  these  jobs  spoiled,  since  they  get 
everything  they  want. 

Of  the  remaining  jobs,  we  identified  two  gronps:  docile  jobs  are  the  ones  that  do  what  the 
system  wants  them  to  do  —  if  they  are  offered  fewer  than  A  processors,  they  rnn  immediately  on 
the  small  clnster.  Rebels  are  the  jobs  that  wonld  benefit  by  defying  the  system  and  waiting  for  a 
larger  clnster. 

In  onr  simnlations,  38%  of  the  non-spoiled  jobs  (14%  of  all  jobs)  shonld  rebel.  For  each  rebel, 
we  calcnlate  the  time  savings,  which  is  the  difference  between  the  job’s  actnal  tnrnaronnd  time 
and  its  best  possible  tnrnaronnd  time.  The  median  savings  for  a  rebel  is  15  minntes;  the  average 
is  1.6  honrs.  These  are  snbstantial  savings,  considering  that  the  median  dnration  for  all  jobs  is  3 
minntes  and  the  average  dnration  is  1.3  honrs.  We  conclnde  that  nsing  qnene  time  predictions  for 
processor  allocation  has  great  potential  to  rednce  tnrnaronnd  times. 

5.3  PRED:  using  realistic  predictions 

Of  conrse,  schednling  is  easy  with  the  benefit  of  an  oracle.  We  wonld  like  to  know  how  mnch  time 
rebels  will  save  nsing  less  accnrate  predictions  abont  qnene  times.  In  this  section  we  will  compare 
the  performance  of  the  optimal  strategy  to  an  implementable  strategy  called  PRED.  PRED  is 
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the  same  as  OPT  except  that  instead  of  knowing  Q{n)  deterministically,  we  estimate  it  nsing  the 
predictors  described  in  Section  3. 

Under  PRED,  a  rebel  may  reconsider  its  decision  after  some  time  and,  based  on  a  new  set  of 
predictions,  decide  to  start  rnnning.  Of  conrse,  recompnting  predictions  incnrs  overhead;  thns,  it 
is  not  clear  how  often  jobs  shonld  be  prompted.  In  onr  system,  jobs  are  prompted  whenever  a  new 
job  arrives  in  qnene  and  whenever  the  predicted  qnene  time  elapses. 

Onr  metric  of  performance  is  total  time  savings  divided  by  the  total  nnmber  of  jobs.  OPT 
maximizes  this  time  savings  per  job.  We  simnlated  PRED  with  the  same  workload,  job-for-job, 
we  nsed  to  simnlate  OPT  (120  days,  30421  jobs,  p  —  0.75).  The  average  time  savings  per  job  was 
12.8  minntes,  compared  to  the  optimal  13.8  minntes.  Thns,  from  the  point  of  view  of  individnal 
applications,  PRED  is  within  7%  of  optimal. 

Of  the  rebels,  only  a  few  (8%)  end  np  taking  longer  than  they  wonld  if  they  were  docile,  and  for 
these  the  time  lost  is  small  (3.7  minntes  on  average).  Eor  the  majority,  the  benefit  is  snbstantial; 
the  median  time  savings  is  55  minntes;  the  average  is  2.7  honrs^. 

5.4  BIAS:  bias-corrected  prediction 

Each  time  a  simnlated  job  nses  a  prediction  to  make  an  allocation  decision,  we  record  the 
prediction  and  the  ontcome.  Eignre  5a  shows  a  scatterplot  of  these  predicted  and  actnal  qnene 
times.  We  measnre  the  qnality  of  the  predictions  by  two  metrics,  accnracy  and  bias.  Accnracy  is 
the  tendency  of  the  predictions  and  ontcomes  to  be  correlated;  the  coefficient  of  correlation  (CC) 
of  the  valnes  in  Eignre  5a  is  0.48  (all  statistical  calcnlations  are  performed  nnder  a  logarithmic 
transformation) . 

Bias  is  the  tendency  of  the  predictions  to  be  consistently  too  high  or  too  low.  The  lines  in 
the  fignre,  which  track  the  mean  and  median  of  each  colnmn,  show  that  short  predictions  (nnder 
ten  minntes)  are  nnbiased,  bnt  that  longer  predictions  have  a  strong  tendency  to  be  too  high.  We 
can  qnantify  this  bias  by  fitting  a  least-sqnares  line  to  the  data.  Eor  a  perfect  predictor,  the  slope 
wonld  be  1  and  the  intercept  0;  for  onr  predictors  the  slope  of  this  line  is  0.6  and  the  intercept  1.7. 

Eortnnately,  if  we  know  that  a  predictor  is  biased,  and  we  estimate  the  parameters  of  this 
bias  (based  on  previons  predictions),  we  can  correct  the  bias  by  applying  a  transformation  to  the 
calcnlated  valnes.  In  this  case,  we  estimate  the  intercept  (j3q)  and  slope  (/?i)  of  the  trend  line, 
and  apply  the  transformation  Pcorr  =  P  *  /?i  +  /?o,  where  p  is  the  calcnlated  prediction  and  Pcorr 
is  the  modified  prediction.  Eignre  5b  shows  the  effect  of  rnnning  the  simnlator  again  nsing  this 
transformation.  The  slope  of  the  new  trend  line  is  1.01  and  the  intercept  is  -0.01,  indicating  that 
we  have  almost  completely  eliminated  the  bias. 

Althongh  we  expected  to  be  able  to  correct  bias,  we  did  not  expect  this  transformation  to 
improve  the  accnracy  of  the  predictions;  the  coefficient  of  correlation  shonld  be  invariant  nnder 
an  affine  transformation.  Snrprisingly,  bias  correction  raises  CC  from  0.48  to  0.59.  This  effect 
is  possible  becanse  past  predictions  infinence  system  state,  which  infinences  fntnre  predictions; 
thns  the  two  scatterplots  do  not  represent  the  same  set  of  predictions.  Bnt  we  do  not  know  why 
nnbiased  predictions  in  the  past  lead  to  more  accnrate  predictions  in  the  fntnre. 

The  improvement  in  bias  and  accnracy  is  reflected  in  greater  time  savings.  Under  BIAS  (PRED 
with  bias-corrected  prediction)  the  average  time  savings  per  job  increases  from  12.8  minntes  to 
13.5  minntes,  within  2%  of  optimal.  In  practice,  the  disadvantage  of  BIAS  is  that  it  reqnires  ns  to 
record  the  resnlt  of  past  predictions  and  estimate  the  parameters  /?o  and  /?i  dynamically. 

^PRED’s  average  time  savings  per  rebel  is  actually  better  than  that  of  OPT,  since  PRED  chooses  fewer,  but 
more  successful,  rebels  (see  Table  1).  This  anomaly  is  the  reason  we  chose  time  savings  per  job  as  our  metric. 
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a)  Raw  predictors 


Actual  vs.  predicted  queue  times 

actual 


b)  Predictors  with  bias  correction 

Actual  vs.  predicted  queue  times 

actual 


predicted 

Figure  5:  Scatterplot  of  predicted  and  actual  queue  times  (log  scale).  The  white  line  shows  the 
identity  function;  i.e.  a  perfect  predictor.  The  solid  line  shows  the  average  of  the  actual  queue 
times  in  each  column.  The  broken  line  shows  the  median. 
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5.5  HEUR:  using  only  application  characteristics 

We  would  like  to  know  whether  our  predictions  are  really  necessary  for  allocation,  or  whether  we 
can  do  as  well  using  only  application  characteristics.  Intuitively,  we  expect  that  we  can  identify 
good  candidates  for  rebellion  by  their  long  lifetimes  and  high  degree  of  parallelism.  Observation 
of  the  optimal  policy  confirms  this  intuition;  the  average  lifetime  of  rebels  is  8  times  the  average 
lifetime  of  docile  jobs. 

We  propose  the  following  heuristic  allocation  policy  (HEUR):  any  job  with  sequential  lifetime 
greater  than  Lthresh  and  parallelism  greater  than  Athresh  will  hold  out  for  at  least  some  fraction,  /, 
of  its  maximum  cluster  size,  A.  The  parameters  Athresh,  ^thresh,  and  /  must  be  tuned  according 
to  system  and  workload  characteristics. 

Like  PRED,  HEUR  requires  a  form  of  prompting.  In  this  case,  if  a  job  rebels,  we  calculate  its 
potential  time  savings,  tgave  —  Ri'^free)  —  Rif  A),  where  njree  is  the  number  of  free  processors, 
and  f  A  is  the  number  of  processors  the  job  is  holding  out  for.  If  a  period  of  time,  ktgave,  elapses 
before  the  job  begins  execution,  the  job  is  forced  to  run  on  the  available  processors.  The  parameter 
k,  again,  must  be  tuned  according  to  the  workload. 

By  a  semi-systematic  exploration  of  parameter-space,  we  found  that  the  following  values  max¬ 
imized  the  performance  of  HEUR:  Lfhresh  —  0,  Athresh  —  f  —  1-0  and  k  =  0.2.  In  other  words, 
all  jobs,  regardless  of  their  characteristics,  should  hold  out  for  at  least  A  processors,  but  they 
should  not  hold  out  for  long.  If  they  use  up  20%  of  their  potential  time  savings  waiting  in  line, 
they  should  concede  and  start  running  on  the  available  processors. 

Using  these  parameters,  the  average  time  saving  per  job  is  12.8  minutes,  which  is  the  same  as 
PRED.  Thus  we  conclude  that  a  well-tuned  heuristic  allocation  strategy  can  do  as  well,  from  the 
application-centric  view,  as  our  predictive  strategy.  In  a  real  system,  it  may  be  difficult  to  tune  the 
four  parameters;  we  have  observed  that  their  optimal  values  are  different  for  other  loads,  system 
sizes  and  workload  models. 

5.6  Summary  of  allocation  strategies 


Table  1:  summary  of  allocation  strategies 


number 
of  rebels 
(%  of  total) 

average 
time  savings 
per  job 

fraction 
of  rebels 
who  lose 

OPT 

4244  (14%) 

13.8  min. 

0% 

PRED 

2388  (8%) 

12.8 

8% 

BIAS 

3089  (10%) 

13.5 

8% 

HEUR 

11158  (37%) 

12.8 

68% 

Table  1  summarizes  the  performance  of  the  four  allocation  strategies.  PRED  and  BIAS  are 
more  conservative  than  OPT;  that  is,  they  choose  fewer  rebellious  jobs.  PRED’s  conservativism 
is  clearly  a  consequence  of  the  tendency  of  our  predictions  to  be  too  long.  By  overestimating 
queue  times,  we  discourage  jobs  from  rebelling.  But  it  is  not  as  clear  why  BIAS,  which  does  not 
overestimate,  is  more  conservative  than  OPT.  In  any  case,  both  prediction-based  strategies  do  a 
good  job  of  selecting  successful  rebels;  only  8%  of  rebels  ended  up  spending  more  time  in  queue 
than  they  save  in  run  time. 
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On  the  other  hand,  HEUR  is  mnch  more  agressive  than  OPT.  More  than  a  third  of  the  jobs 
rebel,  althongh  only  for  a  short  time.  Of  these,  the  majority  (68%)  end  np  wasting  time  in  qnene. 
Since  these  losses  are  small,  and  the  occasional  win  is  big,  the  average  over  all  jobs  is  the  same  as 
PRED,  bnt  it  is  not  clear  whether  nsers  wonld  choose  a  strategy  with  snch  a  high  chance  of  being 
detrimental. 


6  Results:  system-centric 

Until  now,  we  have  been  considering  the  effect  of  allocation  strategies  on  individnal  jobs.  Thns  in 
onr  simnlations  we  have  not  allowed  jobs  to  effect  their  allocation  decisions;  we  have  only  measnred 
what  wonld  happen  if  they  had.  Enrthermore,  when  we  tnned  these  strategies,  we  chose  parameters 
that  were  best  for  individnal  jobs. 

In  this  section  we  modify  onr  simnlations  to  implement  the  proposed  strategies  and  evalnate 
their  effect  on  the  performance  of  the  system  as  a  whole.  We  nse  two  metrics  of  system  performance: 
average  tnrnaronnd  time  (over  all  jobs)  and  ntilization.  We  define  ntilization  as  the  average,  over 
time  and  processors,  of  efficiency.  Efficiency  is  the  ratio  of  speednp  to  clnster  size,  S{n)/n.  In  onr 
simnlations,  we  can  calcnlate  efficiencies  becanse  we  know  the  speednp  cnrves  for  each  job.  In  real 
systems  this  information  is  not  nsnally  available. 

Table  2  shows  the  resnlts  of  each  allocation  strategy,  nsing  the  same  workload  as  in  the  previons 
section.  In  each  case,  we  compare  the  resnlts  with  the  baseline  allocation  strategy,  AVG.  Since 
the  workloads  are  identical,  we  can  compare  them  job-by-job  and  see  which  jobs  do  better  nnder 
which  strategies. 


Table  2  :  system  performance 


Average 
change  in 
ntilization 
(120  days) 

Average 

tnrnaronnd 

time 

(30421  jobs) 

AVG 

— 

4795  seconds 

OPT 

+0.1% 

5236  (+9%) 

PRED 

+2.4% 

4652  (-3%) 

BIAS 

+  1.0% 

5048  (+5%) 

HEUR 

-5% 

6555  (+37%) 

The  predictive  strategies  all  yield  better  ntilization  than  AVG.  This  is  snrprising,  since  these 
strategies  often  leave  processors  idle  (which  decreases  ntilization)  and  allocate  larger  clnsters  (which 
decreases  efficiency).  Thns  we  expected  these  strategies  to  decrease  overall  ntilization. 

The  reason  they  do  not  is  that  these  strategies  are  better  able  to  avoid  i-shaped  schednles. 
Eignre  6  shows  two  schednles  for  the  same  pair  of  jobs.  Under  AVG,  the  second  arrival  wonld 
be  forced  to  rnn  immediately  on  the  small  clnster,  which  improves  ntilization  in  the  short  term 
by  redncing  the  nnmber  of  idle  processors.  Bnt  after  the  first  job  qnits,  many  processors  are  left 
idle  nntil  the  next  arrival.  Onr  predictive  strategies  allow  the  second  job  to  wait  for  a  larger 
clnster,  which  not  only  rednces  the  tnrnaronnd  time  of  the  second  job;  it  also  increases  the  average 
ntilization  of  the  system. 
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Schedule  under  AVG 


running  job 

many  idle 
processors 

small  cluster,  long  run  time 

time 

Schedule  with  prediction 

running  job 

fewer  idle  processors 

larger 

cluster 

wait 

time 


Figure  6:  Sample  schedules  showing  how  longer  queue  times  and  larger  cluster  sizes  can,  para¬ 
doxically,  improve  system  utilization.  Queue  time  prediction  makes  it  possible  to  avoid  i-shaped 
schedules  and  thereby  reduce  the  number  of  idle  processors. 


Despite  this  improvement,  both  OPT  and  BIAS  lead  to  longer  turnaround  times,  on  average, 
than  AVG.  The  reason  for  this  degradation  is  that  rebellious  jobs  impose  longer  queue  times  on 
the  other  jobs  in  queue.  By  using  conservative  predictions,  PRED  reduces  the  number  of  rebels 
enough  that  the  average  turnaround  time  is  actually  better  than  under  AVG.  HEUR  is  much  more 
agressive  in  its  selection  of  rebels,  and  the  system  pays  for  it.  The  average  turnaround  time  under 
HEUR  is  37%  longer  than  under  AVG. 

Since  PRED  produces  the  best  results  for  the  system,  it  is  tempting  to  say  that  it  is  the  best 
choice  for  a  real  system.  Unfortunately,  it  does  not  satisfy  our  goal,  which  is  to  find  an  allocation 
strategy  that  is  robust  in  the  presence  of  self-interested  users.  In  the  case  of  PRED,  users  would 
eventually  notice  that  the  predicted  queue  times  were  consistently  too  high;  thus,  they  might  apply 
bias  correction  on  their  own  behalf.  The  result  would  be  performance  similar  to  BIAS. 

6.1  Improvements 

One  way  to  reduce  the  impact  of  rebellion  on  global  performance  is  to  internalize  the  cost  rebels 
impose  on  the  jobs  behind  them.  Instead  of  minimizing  the  turnaround  time  of  each  job,  Q{n)  -\- 
R{n),  the  system  might  minimize  the  sum  of  run  time  and  the  queue  time  imposed  on  all  jobs, 
qQ{n)  -h  R{n),  where  q  is  the  number  of  jobs  in  queue. 

The  problem  with  this  approach  is  that  it  takes  us  back  where  we  started;  some  users  will  be 
forced  to  accept  allocations  that  are  contrary  to  their  immediate  interests,  and  these  users  will 
have  incentive  to  subvert  the  system. 

An  alternative,  which  can  be  enforced  by  the  system,  is  lock-in.  If  a  job  chooses  to  rebel,  it 
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must  declare  the  number  of  processors  it  is  holding  out  for;  that  is,  the  value  of  n  that  minimizes 
the  rebehs  expected  turnaround  time.  Then,  even  if  more  processors  become  available,  the  rebel 
can  allocate  only  n.  We  expect  this  mechanism  to  reduce  the  queue  time  rebels  impose  on  other 
jobs,  and  thereby  improve  average  turnaround  time.  In  our  simulations,  lock-in  reduces  the  average 
turnaround  time  under  BIAS  from  5048  seconds  to  5011  seconds  (still  4.5%  longer  than  AVG). 

Another  way  to  mitigate  the  impact  of  rebellion  is  to  discourage  jobs  from  holding  out  for  short 
gains.  We  simulate  a  strategy  in  which  a  job  only  rebels  if  its  predicted  time  savings  are  greater 
than  20%  of  its  run  time.  Although  this  strategy  does  not  strictly  minimize  expected  turnaround 
time,  it  would  be  preferable  to  users  who  are  risk-averse.  Adding  risk  aversion  to  BIAS  with  lock-in 
decreases  the  average  turnaround  time  to  4783  seconds  (marginally  lower  than  under  AVG). 

Thus,  with  some  modification,  BIAS  satisfies  our  design  criteria:  it  allows  users  to  maximize 
the  performance  of  their  jobs  using  all  the  information  at  their  disposal,  yet  it  prevents  this  local 
optimization  from  degrading  the  overall  performance  of  the  system. 

7  Conclusions 

•  We  have  proposed  an  allocation  policy  that  never  creates  the  incentive  for  users  to  subvert  the 
system;  thus,  we  expect  it  to  perform  well  in  the  presence  of  self-interested  users.  We  show 
that  this  application- centric  policy  yields  global  performance  as  good  as  the  best  system¬ 
centric  policy. 

•  Using  predicted  queue  times  to  choose  cluster  sizes  significantly  reduces  the  turnaround  time 
of  individual  jobs,  even  if  the  prediction  is  not  very  accurate.  In  our  simulations,  the  average 
time  savings  per  job  is  13.5  minutes  (the  average  job  duration  is  78  minutes).  From  the  point 
of  view  of  individual  applications,  our  predictive  allocation  strategy  is  within  2%  of  optimal. 

•  We  show  that  it  is  possible  to  improve  the  performance  of  some  jobs  using  only  application 
characteristics  and  ignoring  the  state  of  the  system,  but  we  observe  that  this  strategy  has  a 
disastrous  effect  on  the  overall  performance  of  the  system. 

7.1  Future  work 

In  this  paper  we  have  considered  a  single  system  size  (128  processors),  distribution  of  application 
characteristics  (see  Section  2),  and  load  {p  —  0.75).  We  would  like  to  evaluate  the  effect  of  each  of 
these  parameters  on  our  results. 

Also,  we  have  modeled  an  environment  in  which  users  provide  no  information  to  the  system 
about  the  run  times  of  their  jobs.  As  a  result,  our  predictions  are  not  very  accurate.  In  the  real 
systems  we  have  examined,  the  information  provided  by  users  significantly  improves  the  quality  of 
the  predictions  [7].  We  would  like  to  investigate  the  effect  this  improvement  on  our  results. 

As  part  of  the  DOCT  project  [14]  we  are  in  the  process  of  implementing  system  agents  that 
provide  predicted  queue  times  on  space-sharing  parallel  machines.  Users  can  take  advantage  of  this 
information  to  choose  what  jobs  to  run,  when  to  run  them,  and  how  many  processors  to  allocate 
for  each.  We  expect  that  this  information  will  improve  user  satisfaction  with  these  systems,  and 
hope  that,  as  in  our  simulations,  it  will  lead  to  improvement  in  the  overall  performance  of  the 
system. 


18 


References 


[1]  M.  J.  Atallah,  C.  L.  Black,  D.  C.  Marinescu,  H.  J.  Siegel,  and  T.  L.  Casavant.  Models  and 
algorithms  for  coschednling  compnte-intensive  tasks  on  a  network  of  workstations.  Journal  of 
Parallel  and  Distributed  Computing,  16(4):319-327,  Dec  1992. 

[2]  Francine  Berman  and  Rich  Wolski,  Principal  Investigators.  AppLeS  Home  Page  http://www- 
cse.ncsd.edn/gronps/hpcl/apples/apples.html.  University  of  California  at  San  Diego,  1996. 

[3]  Francine  Berman  and  Rich  Wolski.  Schednling  from  the  perspective  of  the  application.  In 
Proceedings  of  the  High  Perfomance  Distributed  Computing  Conference,  Angnst  1996. 

[4]  Sn-Hni  Chiang,  Rajesh  K.  Mansharamani,  and  Mary  K.  Vernon.  Use  of  application  character¬ 
istics  and  limited  preemption  for  rnn-to-completion  parallel  processor  schednling  policies.  In 
Proceedings  of  the  ACM  Sigmetrics  Conference  on  Measurement  and  Modeling  of  Computer 
Systems,  1994. 

[5]  Allen  B.  Downey.  A  model  for  speednp  of  parallel  programs.  In  preparation,  1996. 

[6]  Allen  B.  Downey.  A  parallel  workload  model  and  its  implications  for  processor  allocation. 
Snbmitted  for  pnblication  in  SIGMETRICS’97,  1996. 

[7]  Allen  B.  Downey.  Predicting  qnene  times  on  space-sharing  parallel  compnters.  In  11th  Inter¬ 
national  Parallel  Processing  Symposium,  April  1997.  To  appear.  Also  available  as  University 
of  California  technical  report  nnmber  CSD-96-906. 

[8]  Derek  L.  Eager,  John  Zahorjan,  and  Edward  L.  Lazowska.  Speednp  versns  efficiency  in  parallel 
systems.  IEEE  Transactions  on  Computers,  38(3):408-423,  March  1989. 

[9]  Dror  G.  Feitelson  and  Bill  Nitzberg.  Job  characteristics  of  a  prodnction  parallel  scientific 
workload  on  the  NASA  Ames  iPSC/860.  In  IPPS  T5  Workshop  on  Job  Scheduling  Strategies 
for  Parallel  Processing,  pages  215-227,  1995. 

[10]  J.  Gehring  and  A  Reinefeld.  MARS  —  a  framework  for  minimizing  the  job  execntion  time  in 
a  metacompnting  environment.  In  Euture  Generation  Computer  Systems  (EGCS),  1996.  To 
appear. 

[11]  Dipak  Ghosal,  Ginseppe  Serazzi,  and  Satish  K.  Tripathi.  The  processor  working  set  and 
its  nse  in  schednling  mnltiprocessor  systems.  IEEE  Transactions  on  Software  Engineering, 
17(5):443-453,  May  1991. 

[12]  Shikharesh  Majnmdar,  Derek  L.  Eager,  and  Richard  B.  Bnnt.  Schednling  in  mnltiprogrammed 
parallel  systems.  In  Proceedings  of  the  ACM  Sigmetrics  Conference  on  Measurement  and 
Modeling  of  Computer  Systems,  pages  104-113,  1988. 

[13]  Rajesh  K.  Mansharamani  and  Mark  K.  Vernon.  Comparison  of  processor  allocation  policies 
for  parallel  systems.  Technical  report.  University  of  Wisconsin,  December  1993. 

[14]  Reagan  Moore  and  Richard  Klobnchar,  Principal  Investigators.  DOCT  Home  Page 
http://www.sdsc.edn/DOCT.  San  Diego  Snpercompnter  Center,  1996. 


19 


[15]  Vijay  K.  Naik,  Sanjeev  K.  Setia,  and  Mark  S.  Squillante.  Performance  analysis  of  job  schednl- 
ing  policies  in  parallel  snpercompnting  environments.  In  Supercomputing  ^93  Conference  Pro¬ 
ceedings,  pages  824-833,  March  1993. 

[16]  Eric  W.  Parsons  and  Kenneth  C.  Sevcik.  Coordinated  allocation  of  memory  and  processors 
in  mnltiprocessors.  In  Proceedings  of  the  ACM  Sigmetrics  Conference  on  Measurement  and 
Modeling  of  Computer  Systems,  pages  57-67,  May  1996. 

[17]  Emilia  Rosti,  Evgenia  Smirni,  Lawrence  W.  Dowdy,  Ginseppe  Serazzi,  and  Brian  M.  Carlson. 
Robnst  partitioning  policies  of  mnltiprocessor  systems.  Performance  Evaluation,  19(2-3):  141- 
165,  Mar  1994. 

[18]  Emilia  Rosti,  Evgenia  Smirni,  Ginseppe  Serazzi,  and  Lawrence  W.  Dowdy.  Analysis  of  non- 
work-conserving  processor  partitioning  policies.  In  IPPS  ^95  Workshop  on  Job  Scheduling 
Strategies  for  Parallel  Processing,  pages  101-111,  1995. 

[19]  Sanjeev  K.  Setia  and  Satish  K.  Tripathi.  An  analysis  of  several  processor  partitioning  policies 
for  parallel  compnters.  Technical  Report  CS-TR-2684,  University  of  Maryland,  May  1991. 

[20]  Sanjeev  K.  Setia  and  Satish  K.  Tripathi.  A  comparative  analysis  of  static  processor  partition¬ 
ing  policies  for  parallel  compnters.  In  Proceedings  of  the  Internations al  Workshop  on  Modeling 
and  Simulation  of  Computer  and  Telecommunications  Systems  (MASCOTS),  Jannary  1993. 

[21]  Kenneth  C.  Sevcik.  Characterizations  of  parallelism  in  applications  and  their  nse  in  schednling. 
Performance  Evaluation  Review,  17(1):171-180,  May  1989. 

[22]  Evgenia  Smirni,  Emilia  Rosti,  Lawrence  W.  Dowdy,  and  Ginseppe  Serazzi.  Evalnation  of 
mnltiprocessor  allocation  policies.  Technical  report,  Vanderbilt  University,  1993. 

[23]  Knrt  Windisch,  Virginia  Lo,  Dror  Eeitelson,  Bill  Nitzberg,  and  Reagan  Moore.  A  comparison 
of  workload  traces  from  two  prodnction  parallel  machines.  In  6th  Symposium  on  the  Erontiers 
of  Massively  Parallel  Computation,  1996. 


20 


